BinaryBERT: Pushing the Limit of BERT Quantization

135

FIGURE 5.11

Loss landscapes visualization of the full-precision, ternary, and binary models on

MRPC [230].

where x ∈{±0.2 ¯

W1, ±0.4 ¯

W1, ..., ±1.0 ¯

W1} are perturbation magnitudes based the absolute

mean value ¯

W1 of W1, and similar rules hold for y. 1x and 1y are vectors with all elements

being 1. For each pair of (x, y), the corresponding training loss is shown in Fig. 5.11. As can

be seen, the full-precision model has the lowest overall training loss, and its loss landscape

is flat and robust to the perturbation. For the ternary model, despite the surface tilts up

with larger perturbations, it looks locally convex and is thus easy to optimize. This may

also explain why the BERT models can be ternarized without severe accuracy drop [285].

However, the loss landscape of the binary model turns out to be higher and more complex.

By stacking the three landscapes together, the loss surface of the binary BERT stands on

the top with a clear margin with the other two. The steep curvature of loss surface reflects

a higher sensitivity to binarization, which attributes to the training difficulty.

The authors further quantitatively measured the steepness of loss landscape, start-

ing from a local minima W and apply the second order approximation to the curvature.

According to the Taylor’s expansion, the loss increase induced by quantizing W can be

approximately upper bounded by

( ˆ

W)(W)ϵHϵλmaxϵ2,

(5.19)

where ϵ = Wˆ

W is the quantization noise, and λmax is the largest eigenvalue of the

Hessian H at w. Note that the first-order term is skipped due to(W) = 0. By taking

λmax [208] as a quantitative measurement for the steepness of the loss surface, the authors

separately calculated λmax for each part of BERT as (1) the query/key layers (MHA-QK),

(2) the value layer (MHA-V), (3) the output projection layer (MHA-O) in the multi-head

attention, (4) the intermediate layer (FFN-Mid), and (5) the output layer (FFN-Out) in the

feed-forward network. From Fig. 5.12, the top-1 eigenvalues of the binary model are higher

FIGURE 5.12

The top-1 eigenvalues of parameters at different Transformer parts of the full-precision (FP),

ternary, and binary BERT.